This dataset includes a record of the date, time, location, depth, magnitude, and source of every earthquake with a reported magnitude \(5.5\) or higher from \(1965-2016\).
In this analysis, we aim to gain a basic understanding of our data through iterative exploration to hopefully give us insight into where, when and the impact of earthquakes on a global scale. These patterns could help us determine the most vulnerable locations where further considerations and actions can be taken to ensure the safety of people and also the impact to the environment, including the threat of tsunamis.
Load libraries
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'numform'
## The following object is masked from 'package:dplyr':
##
## collapse
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble 3.0.3 ✓ purrr 0.3.4
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x forcats::as_factor() masks numform::as_factor()
## x lubridate::as.difftime() masks base::as.difftime()
## x numform::collapse() masks dplyr::collapse()
## x dplyr::combine() masks gridExtra::combine()
## x lubridate::date() masks base::date()
## x dplyr::filter() masks stats::filter()
## x lubridate::intersect() masks base::intersect()
## x dplyr::lag() masks stats::lag()
## x lubridate::setdiff() masks base::setdiff()
## x lubridate::union() masks base::union()
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
##
## Attaching package: 'hms'
## The following object is masked from 'package:lubridate':
##
## hms
## Linking to GEOS 3.8.1, GDAL 3.1.1, PROJ 6.3.1
Set up a consistent plot theme
# from model project.
global_theme <- theme_few() + # Theme based on S. Few's "Practical Rules for Using Color in Charts"
theme(plot.title = element_text(color = "darkred")) +
theme(strip.text.x = element_text(size = 14, colour = "#202020")) +
theme(plot.margin=margin(10,30,10,30))
global_map_theme <- global_theme +
theme(panel.grid = element_blank(),
panel.border = element_blank(),
legend.position="top", legend.direction = 'horizontal',
axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())
Load data. Upon an initial scan of the dataset and referring to data dictionary, it is evident that some missing values are represented as empty strings. We will convert them to NAs and infer their meaning, if any, through the dictionary.
data_path <- './data/earthquakes.csv'
earthquakes <- read.csv(data_path, na = "")
Let us check the structure of our data.
## 'data.frame': 23412 obs. of 21 variables:
## $ Date : chr "01/02/1965" "01/04/1965" "01/05/1965" "01/08/1965" ...
## $ Time : chr "13:44:18" "11:29:49" "18:05:58" "18:49:43" ...
## $ Latitude : num 19.25 1.86 -20.58 -59.08 11.94 ...
## $ Longitude : num 145.6 127.4 -174 -23.6 126.4 ...
## $ Type : chr "Earthquake" "Earthquake" "Earthquake" "Earthquake" ...
## $ Depth : num 132 80 20 15 15 ...
## $ Depth.Error : num NA NA NA NA NA NA NA NA NA NA ...
## $ Depth.Seismic.Stations : int NA NA NA NA NA NA NA NA NA NA ...
## $ Magnitude : num 6 5.8 6.2 5.8 5.8 6.7 5.9 6 6 5.8 ...
## $ Magnitude.Type : chr "MW" "MW" "MW" "MW" ...
## $ Magnitude.Error : num NA NA NA NA NA NA NA NA NA NA ...
## $ Magnitude.Seismic.Stations: int NA NA NA NA NA NA NA NA NA NA ...
## $ Azimuthal.Gap : num NA NA NA NA NA NA NA NA NA NA ...
## $ Horizontal.Distance : num NA NA NA NA NA NA NA NA NA NA ...
## $ Horizontal.Error : num NA NA NA NA NA NA NA NA NA NA ...
## $ Root.Mean.Square : num NA NA NA NA NA NA NA NA NA NA ...
## $ ID : chr "ISCGEM860706" "ISCGEM860737" "ISCGEM860762" "ISCGEM860856" ...
## $ Source : chr "ISCGEM" "ISCGEM" "ISCGEM" "ISCGEM" ...
## $ Location.Source : chr "ISCGEM" "ISCGEM" "ISCGEM" "ISCGEM" ...
## $ Magnitude.Source : chr "ISCGEM" "ISCGEM" "ISCGEM" "ISCGEM" ...
## $ Status : chr "Automatic" "Automatic" "Automatic" "Automatic" ...
Date and Time variables should be converted to an appropriate formatType and Status are categorical and should be coerced to factorsclass: num data seems to have many missing valuesLet us quickly assess the summary statistics of our data before solving the problems described above.
## Date Time Latitude Longitude
## Length:23412 Length:23412 Min. :-77.080 Min. :-180.00
## Class :character Class :character 1st Qu.:-18.653 1st Qu.: -76.35
## Mode :character Mode :character Median : -3.568 Median : 103.98
## Mean : 1.679 Mean : 39.64
## 3rd Qu.: 26.191 3rd Qu.: 145.03
## Max. : 86.005 Max. : 180.00
##
## Type Depth Depth.Error Depth.Seismic.Stations
## Length:23412 Min. : -1.10 Min. : 0.000 Min. : 0.0
## Class :character 1st Qu.: 14.52 1st Qu.: 1.800 1st Qu.:146.0
## Mode :character Median : 33.00 Median : 3.500 Median :255.0
## Mean : 70.77 Mean : 4.993 Mean :275.4
## 3rd Qu.: 54.00 3rd Qu.: 6.300 3rd Qu.:384.0
## Max. :700.00 Max. :91.295 Max. :934.0
## NA's :18951 NA's :16315
## Magnitude Magnitude.Type Magnitude.Error Magnitude.Seismic.Stations
## Min. :5.500 Length:23412 Min. :0.000 Min. : 0.00
## 1st Qu.:5.600 Class :character 1st Qu.:0.046 1st Qu.: 10.00
## Median :5.700 Mode :character Median :0.059 Median : 28.00
## Mean :5.883 Mean :0.072 Mean : 48.95
## 3rd Qu.:6.000 3rd Qu.:0.076 3rd Qu.: 66.00
## Max. :9.100 Max. :0.410 Max. :821.00
## NA's :23085 NA's :20848
## Azimuthal.Gap Horizontal.Distance Horizontal.Error Root.Mean.Square
## Min. : 0.00 Min. : 0.005 Min. : 0.085 Min. :0.000
## 1st Qu.: 24.10 1st Qu.: 0.969 1st Qu.: 5.300 1st Qu.:0.900
## Median : 36.00 Median : 2.320 Median : 6.700 Median :1.000
## Mean : 44.16 Mean : 3.993 Mean : 7.663 Mean :1.023
## 3rd Qu.: 54.00 3rd Qu.: 4.724 3rd Qu.: 8.100 3rd Qu.:1.130
## Max. :360.00 Max. :37.874 Max. :99.000 Max. :3.440
## NA's :16113 NA's :21808 NA's :22256 NA's :6060
## ID Source Location.Source Magnitude.Source
## Length:23412 Length:23412 Length:23412 Length:23412
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Status
## Length:23412
## Class :character
## Mode :character
##
##
##
##
class: num data has almost all of its values missing. Perhaps we should remove these variables.Depth and Magnitude are positively skewed. This makes sense as the Magnitude scale is logarithmic. We will revisit this.head(earthquakes)
We should aim to combined Date and Time into a date/time object.
# create datetime column with conversion
date_time <- paste(earthquakes$Date, earthquakes$Time)
earthquakes$Date.Time <- as.POSIXct(strptime(date_time, '%m/%d/%Y %H:%M:%S') )
## Date Time Date.Time
## Length:23412 Length:23412 Min. :1965-01-02 13:44:18
## Class :character Class :character 1st Qu.:1981-04-11 05:29:33
## Mode :character Mode :character Median :1993-11-30 20:44:13
## Mean :1993-02-18 12:17:07
## 3rd Qu.:2005-09-09 19:55:22
## Max. :2016-12-30 20:08:28
## NA's :3
We seem to have gotten \(3\) NA values from our conversion. Let’s investigate.
date_regex <- "^[0-9]{2}/[0-9]{2}/[0-9]{4}$"
bad_datetimes <- subset( earthquakes, !grepl(date_regex, Date) )
bad_datetimes
It would not be entirely bad to perform listwise deletion on these observations for the purposes of our exploration. We will have these values stored in bad_datetimes if needed for future reference.
earthquakes <- subset( earthquakes, grepl(date_regex, Date) )
paste0( "New number of observations: ", nrow(earthquakes) )
## [1] "New number of observations: 23409"
With reference to the data dictionary, we can identify and coerce the categorical data.
categorical <- c("Type", "Magnitude.Type", "Source", "Location.Source", "Magnitude.Source", "Status")
earthquakes[categorical] <- lapply(earthquakes[categorical], factor) # coerce categorical vars
colSums(is.na(earthquakes[,categorical])) # check for NAs
## Type Magnitude.Type Source Location.Source
## 0 3 0 0
## Magnitude.Source Status
## 0 0
Magnitude.Type is the only categorical variable with NAs. The data dictionary states that these correspond to another type than the categories listed. Let’s create another category, Other, for these values.
earthquakes <- mutate_if(earthquakes, is.factor, fct_explicit_na, na_level = 'OTHER')
levels(earthquakes$Magnitude.Type)
## [1] "MB" "MD" "MH" "ML" "MS" "MW" "MWB" "MWC" "MWR"
## [10] "MWW" "OTHER"
Let’s quickly change some variable names so the data is easier to work with.
earthquakes <- rename(earthquakes, `Travel.Time.RMS`=`Root.Mean.Square`, `Data.Source`=`Source`, `Location.Data.Source`=`Location.Source`, `Magnitude.Data.Source`=`Magnitude.Source`, `Seismic.Event.Type`=`Type`)
names(earthquakes)
## [1] "Date" "Time"
## [3] "Latitude" "Longitude"
## [5] "Seismic.Event.Type" "Depth"
## [7] "Depth.Error" "Depth.Seismic.Stations"
## [9] "Magnitude" "Magnitude.Type"
## [11] "Magnitude.Error" "Magnitude.Seismic.Stations"
## [13] "Azimuthal.Gap" "Horizontal.Distance"
## [15] "Horizontal.Error" "Travel.Time.RMS"
## [17] "ID" "Data.Source"
## [19] "Location.Data.Source" "Magnitude.Data.Source"
## [21] "Status" "Date.Time"
Status refers to whether the observation has been “Reviewed” by a human or generated “Automatically”. Let’s make our dataframe more descriptive by making this variable logical.
earthquakes$Status <- earthquakes$Status == "Reviewed"
earthquakes <- rename(earthquakes, `Human.Checked`=`Status`)
summary(earthquakes$Human.Checked)
## Mode FALSE TRUE
## logical 2639 20770
Let’s have a peak at the Magnitude distribution.
ggplot(earthquakes, aes(x=Magnitude, y=Depth)) + geom_point() + ggtitle("Magnitude Distibution") + ylab("Depth (Km)") + global_theme
Magnitude looks quite discrete. It would be a good idea to categorise the events according to USGS as per the below.
brks <- c(0, 3.9, 5.9, 6.9, 7.9, Inf)
earthquakes$Magnitude.Range <- cut(earthquakes$Magnitude, breaks = brks)
levels(earthquakes$Magnitude.Range) <- c("Minor", "Light", "Moderate", "Strong", "Major", "Catastrophic")
summary(earthquakes$Magnitude.Range)
## Minor Light Moderate Strong Major Catastrophic
## 0 16053 6618 698 40 0
colSums(is.na(earthquakes))
## Date Time
## 0 0
## Latitude Longitude
## 0 0
## Seismic.Event.Type Depth
## 0 0
## Depth.Error Depth.Seismic.Stations
## 18949 16313
## Magnitude Magnitude.Type
## 0 0
## Magnitude.Error Magnitude.Seismic.Stations
## 23082 20845
## Azimuthal.Gap Horizontal.Distance
## 16111 21805
## Horizontal.Error Travel.Time.RMS
## 22253 6059
## ID Data.Source
## 0 0
## Location.Data.Source Magnitude.Data.Source
## 0 0
## Human.Checked Date.Time
## 0 0
## Magnitude.Range
## 0
After another visual scan of the data, it seems that Azimuthal.Gap and Travel.Time.RMS align their non missing values quite closely. Let’s validate this.
num_rms_na <- dim(filter(earthquakes, !is.na(Azimuthal.Gap)))[1]
all_na <- filter(earthquakes, !is.na(Travel.Time.RMS), !is.na(Azimuthal.Gap))
paste0("Correlation of non missing values: ", round((dim(all_na)[1] / num_rms_na) * 100, 2), "%")
## [1] "Correlation of non missing values: 97.81%"
We have verified that almost all of the non missing values from Azimuthal.Gap have a non missing value in Travel.Time.RMS. It might be worth it to keep a smaller dataset with these values for later analysis.
small_earthquake_data <- earthquakes[!is.na(earthquakes$Azimuthal.Gap),]
tail(small_earthquake_data)
Almost all the observations in Depth.Error, Magnitude.Error, Magnitude.Seismic.Stations, Depth.Seismic.Stations, Azimuthal.Gap, Horizontal.Distance and Horizontal.Error are missing. I am going to make the judgment call and drop these variables from our dataset.
drop <- c("Depth.Error", "Magnitude.Error", "Magnitude.Seismic.Stations", "Horizontal.Distance", "Horizontal.Error", "Depth.Seismic.Stations", "Azimuthal.Gap")
earthquakes_simp <- earthquakes[,!(names(earthquakes) %in% drop)]
small_earthquake_data <- small_earthquake_data[,!(names(small_earthquake_data) %in% drop)] # same for our small earthquake set
names(earthquakes_simp)
## [1] "Date" "Time" "Latitude"
## [4] "Longitude" "Seismic.Event.Type" "Depth"
## [7] "Magnitude" "Magnitude.Type" "Travel.Time.RMS"
## [10] "ID" "Data.Source" "Location.Data.Source"
## [13] "Magnitude.Data.Source" "Human.Checked" "Date.Time"
## [16] "Magnitude.Range"
Travel.Time.RMS has only \(6059\) missing values which is relatively small. Let’s see where these missing values are relative to the year.
ggplot(earthquakes_simp, aes(x = year(Date.Time), y = length(Travel.Time.RMS))) +
geom_bar(aes(fill = is.na(Travel.Time.RMS)), stat = "identity", position="fill") +
labs(title="Missing values for Travel Time Root Mean Square by Year", y="Count", x="Year", fill="is missing") +
global_theme
From above, most missing observations are grouped together in the earlier years. Let’s perform listwise deletion on these and work with data past \(1980\).
earthquakes_simp <- earthquakes_simp[!is.na(earthquakes_simp$Travel.Time.RMS),]
small_earthquake_data <- small_earthquake_data[!is.na(small_earthquake_data$Travel.Time.RMS),] # same for our small earthquake set
paste0( "New number of observations: ", nrow(earthquakes_simp) )
## [1] "New number of observations: 17350"
Finally, the data is clean and usable. Now for the fun part.
It would first be interesting to see where seismic events historically occur to identify the most affected locations.
ggplot(data = world) +
geom_sf() +
ggtitle("Seismic Events", subtitle = "1966-2016") +
geom_point(data=earthquakes_simp, aes(x=Longitude, y=Latitude, colour = Seismic.Event.Type), alpha=0.2) +
geom_text(x=150, y=30, label="Eastern-Asia, Japan") +
geom_text(x=130, y=0, label="East Asia and the Pacific") +
geom_text(x=-100, y=0, label="Latin America and Caribbean") +
global_map_theme +
theme(legend.title = element_blank())
The above shows the seismic event locations which are overwhelmingly from earthquakes. The path of the tectonic plates are visible where the slip-on-faults that define the plate boundaries commonly results in earthquakes. The most destructive path falls along East Asia and the Pacific up to Eastern-Asia, Japan and around toward the Latin America and Caribbean region. This is historically called the Circum-Pacific seismic belt and seems to hold a dense number of earthquakes.
Let us filter these results to identify the most harmful earthquakes with a magnitude above \(6\).
danger <- c("Strong", "Major", "Catestrophic")
dangerous_events <- earthquakes_simp[ earthquakes_simp$Magnitude.Range %in% danger, ]
ggplot(data = world) +
geom_sf() +
ggtitle("Large Seismic Events", subtitle = "1966-2016") +
geom_point(data=dangerous_events, aes(x=Longitude, y=Latitude), colour = "darkred", alpha=0.2) +
global_map_theme +
theme(legend.position = "none")
The Circum-Pacific seismic belt is defiantly more noticeable now and seems to hold most of the large magnitude earthquakes. It seems the most affected areas by large magnitude earthquakes are east Asia and the pacific, Latin America and Caribbean, Japan and southern Asia.
Let’s take a closer look at these regions and try really visualise the effective magnitudes. Note that the Richter Scale is logarithmic. This means that the difference between a magnitude of \(5\) and \(6\) is \(10\times\) and the difference between a magnitude of \(5\) and \(7\) is \(100\times\).
We want to represent the magnitude by the size of our instances, i.e. the area of the circles.
We know that:
\[ A \propto r^2 \] thus we need to convert our magnitude value to a linear scale by placing to the power of \(10\). We can then square root this result so that our magnitude value is proportional to the area of a circle, i.e. the size of our instance.
east_asia <- ggplot(data = world) +
geom_sf() +
ggtitle("East Asia and the Pacific", subtitle = "large earthquake events") +
geom_point(data=dangerous_events, aes(x=Longitude, y=Latitude, size = sqrt( Magnitude^10 )), colour = "darkred", alpha=0.3) +
coord_sf(xlim = c(80, 180), ylim = c(-30, 20), expand = FALSE) +
global_map_theme +
theme(legend.position = "none")
latin_america <- ggplot(data = world) +
geom_sf() +
ggtitle("Latin America and Caribbean", subtitle = "large earthquake events") +
geom_point(data=dangerous_events, aes(x=Longitude, y=Latitude, size = sqrt( Magnitude^10 )), colour = "darkred", alpha=0.3) +
coord_sf(ylim=c(-40,30), xlim=c(-140,0), expand = FALSE) +
global_map_theme + theme(legend.position = "none")
japan <- ggplot(data = world) +
geom_sf() +
ggtitle("Eastern-Asia, Japan", subtitle = "large earthquake events") +
geom_point(data=dangerous_events, aes(x=Longitude, y=Latitude, size = sqrt( Magnitude^10 )), colour = "darkred", alpha=0.3) +
coord_sf(ylim=c(10,60), xlim=c(100, 200), expand = FALSE) +
global_map_theme +
theme(legend.position = "none")
south_asia <- ggplot(data = world) +
geom_sf() +
ggtitle("Southern Asia", subtitle = "large earthquake events") +
geom_point(data=dangerous_events, aes(x=Longitude, y=Latitude, size = sqrt( Magnitude^10 )), colour = "darkred", alpha=0.3) +
coord_sf(ylim=c(5,55), xlim=c(10,110), expand = FALSE) +
global_map_theme +
theme(legend.position = "none")
grid.arrange(east_asia, latin_america, japan, south_asia, ncol=2)
Above, we have identified the most active seismic areas in the world. But which region has the most earthquakes?
We previously identified the most active seismic areas. You can see the regions of interest below.
# filter data by important regions
asia_pacific_filter <- earthquakes_simp %>% filter(round(Longitude,digits=0) %in% c(90:150), round(Latitude,digits=0) %in% c(-35:20))
latin_america_caribbean_filter <- earthquakes_simp %>% filter(round(Longitude,digits=0) %in% c(-30:-90), round(Latitude,digits=0) %in% c(-30:25))
east_asia_japan_filter <- earthquakes_simp %>% filter(round(Longitude,digits=0) %in% c(120:150), round(Latitude,digits=0) %in% c(25:55))
southern_asia_filter <- earthquakes_simp %>% filter(round(Longitude,digits=0) %in% c(40:110), round(Latitude,digits=0) %in% c(10:30))
# add region name and combine region data into a single dataframe
asia_pacific_filter$region <- "East Asia and Pacific"
latin_america_caribbean_filter$region <- "Latin America and the Caribbean"
east_asia_japan_filter$region <- "Eastern Asia, Japan"
southern_asia_filter$region <- "Southern Asia"
region_data <- do.call("rbind", list(asia_pacific_filter, latin_america_caribbean_filter, east_asia_japan_filter, southern_asia_filter))
region_data$region <- as.factor(region_data$region) # coerce to factor
ggplot(data = world) +
geom_sf() +
ggtitle("Seismic Active Regions", subtitle = "Clusters") +
geom_point(data=region_data, aes(x=Longitude, y=Latitude, colour = region), alpha=0.2) +
global_map_theme +
theme(legend.title = element_blank())
Let us quantify some of these values to identify which region sees the most earthquakes.
counts <- ggplot(region_data) +
geom_bar(aes(x=region, y=Magnitude), stat="identity", fill="darkred") +
ggtitle("Number of earthquakes per region") +
coord_flip() +
xlab("Region") + ylab("Count") +
theme(axis.text.y=element_text(size=rel(0.8))) +
global_theme
danger_counts <- ggplot(region_data[region_data$Magnitude > 6.0,]) +
geom_bar(aes(x=region, y=Magnitude), stat="identity", fill="darkred") +
ggtitle("Number of mag > 6.0 earthquakes per region") +
coord_flip() +
theme(axis.text.y=element_text(size=rel(0.8))) +
xlab("Region") + ylab("Count") +
global_theme
grid.arrange(counts, danger_counts, nrow=2)
It is evident that the East Asia and Pacific region has by far the most earthquakes. Eastern Asia, Japan has the second greatest but actually has far less dangerous (above \(6.0\) magnitude) earthquakes than both Eastern Asia, Japan and the Latin America and the Caribbean region. This means that Eastern Asia, Japan has far more small magnitude earthquakes. The Southern Asia region falls short on both.
It could be suspicious that Eastern Asia, Japan has so many small magnitude earthquakes recorded in comparison to other regions. It should be noted that the precision of earthquake recordings depend on the equipment used and this can vary depending which network contributor recorded the results. Japan is known to have very sophisticated recording equipment which is capable of recording very small magnitude earthquakes. This potential bias may speak to why Eastern Asia, Japan has many smaller recordings. Let’s investigate this point.
ggplot(region_data[region_data$region=="Eastern Asia, Japan",]) +
geom_bar(aes(x="", y=length(Magnitude.Data.Source), fill=Magnitude.Data.Source), stat="identity") +
ggtitle("Eastern Asia, Japan Data Source") +
coord_polar("y", start=0) +
theme(axis.text.y=element_text(size=rel(0.8))) +
xlab("") + ylab("") + labs(fill = "Contributors") +
global_theme +
scale_fill_brewer(palette="RdGy")
It seems that earthquakes in this region were mostly recorded by US and HRV which are located in the United States. Let us zoom out to see if this is consistent to all regions.
ggplot(region_data) +
geom_bar(aes(x="", y=length(Magnitude.Data.Source), fill=Magnitude.Data.Source), stat="identity") +
ggtitle("Global Earthquake Data Source") +
coord_polar("y", start=0) +
theme(axis.text.y=element_text(size=rel(0.8))) +
xlab("") + ylab("") + labs(fill = "Contributors") +
global_theme +
scale_fill_brewer(palette="RdGy")
When considering all regions, mostly earthquake data was recorded, again, by US and HRV. This disproves my original hypothesis where it seems that the majority of recordings come from the United States. Perhaps it is the case that the Eastern Asia, Japan region has many small magnitude earthquakes.
We have Identified that the East Asia and Pacific region seems the most seismic active but it is also a much larger size than Eastern Asia, Japan which falls in second place. Let’s balance out this inequality and investigate the proportion of earthquakes per unit area.
First we will load in global surface area data.
# src: https://www.mapsofworld.com/thematic-maps/surface-area-map.html
area_data_path <- './data/surface_area.csv'
areas <- read.csv(area_data_path, header = FALSE)
names(areas) <- c("Country", "Surface Area (Million sq. km)")
head(areas)
Now let us plot the earthquake counts with respect to the regions area.
# sum areas of all countries within a particular region to get total
asia_pacific_countries <- c("Indonesia", "Malaysia", "Philippines", "Papua New Guinea", "Myanmar", "Thailand", "Bangladesh", "Vietnam", "Lao PDR")
asia_pacific_area <- sum(areas[areas$Country %in% asia_pacific_countries,]$`Surface Area (Million sq. km)`)
latin_america_caribbean_countries <- c("Mexico", "Guatemala", "Belize", "Honduras", "El Salvador", "Nicaragua", "Colombia", "Ecuador", "Peru", "Bolivia", "Chile")
latin_america_caribbean_area <- sum(areas[areas$Country %in% latin_america_caribbean_countries,]$`Surface Area (Million sq. km)`)
east_asia_japan_countries <- c("Japan", "Korea, Rep.", "Korea, Dem. People's Rep.")
east_asia_japan_area <- sum(areas[areas$Country %in% east_asia_japan_countries,]$`Surface Area (Million sq. km)`)
southern_asia_countries <- c("Saudi Arabia", "Yemen, Rep.", "Oman", "Iraq", "Syrian Arab Republic", "Afghanistan", "Pakistan", "India", "Kyrgyz Republic", "Kazakhstan")
southern_asia_area <- sum(areas[areas$Country %in% southern_asia_countries,]$`Surface Area (Million sq. km)`)
# add areas and combine region data into a single dataframe
asia_pacific_filter$area <- asia_pacific_area
latin_america_caribbean_filter$area <- latin_america_caribbean_area
east_asia_japan_filter$area <- east_asia_japan_area
southern_asia_filter$area <- southern_asia_area
region_data <- do.call("rbind", list(asia_pacific_filter, latin_america_caribbean_filter, east_asia_japan_filter, southern_asia_filter))
# plot results
counts <- ggplot(region_data) +
geom_bar(aes(x=region, y=Magnitude/area), stat="identity", fill="darkred") +
ggtitle("Earthquake density") +
coord_flip() +
xlab("Region") + ylab("Count/Million Km.Sqr") +
theme(axis.text.y=element_text(size=rel(0.8))) +
global_theme
danger_counts <- ggplot(region_data[region_data$Magnitude > 6.0,]) +
geom_bar(aes(x=region, y=Magnitude/area), stat="identity", fill="darkred") +
ggtitle("Mag > 6.0 earthquake desnisty") +
coord_flip() +
theme(axis.text.y=element_text(size=rel(0.8))) +
xlab("Region") + ylab("Count/Million Km.Sqr") +
global_theme
grid.arrange(counts, danger_counts, nrow=2)
Now it is evident that Eastern Asia, Japan has the largest density of earthquakes. This is also consistent for above magnitude \(6\) earthquakes.
depth1 <- ggplot(earthquakes, aes(x=Magnitude, y=Depth)) +
geom_point() +
ggtitle("Depth vs Magnitude") + xlab("") + ylab("Depth (Km)") +
geom_hline(yintercept=70, linetype="dashed", color = "darkred", size=0.5) +
global_theme
depth2 <- ggplot(earthquakes, aes(x=Magnitude.Range, y=Depth)) +
geom_point() +
xlab("Magnitude") + ylab("Depth (Km)") + geom_hline(yintercept=70, linetype="dashed", color = "darkred", size=0.5) +
geom_text(x=4.2, y=130, label="earth crust") +
global_theme
grid.arrange(depth1, depth2, nrow=2)
It seems that Major earthquakes (magnitude > \(7.0\)) mostly occur at low depths (close to the earths surface) but other magnitude earthquakes can occur at any depth. This seems a reasonable observation as the earths crust is the coldest and most brittle part of the earth. This makes it more likely for earthquakes to occur.
Let us look finer into this relationship.
ggplot(earthquakes, aes(x=Magnitude.Range, y=Depth)) +
geom_boxplot() +
xlab("Magnitude") + ylab("Depth (Km)") + geom_hline(yintercept=70, linetype="dashed", color = "darkred", size=0.5) +
geom_text(x=4.2, y=75, label="earth crust") +
global_theme +
coord_cartesian(ylim = boxplot.stats(earthquakes$Depth)$stats[c(1, 5)]*1.05)
We can see here that more than \(75\%\) of observations occur at a depth within the earths crust as the interquartile range falls below a depth of \(70\)km. This makes the statement that most high magnitude earthquakes occur at a low depth, close to the earths surface. In fact, this can be a statement about all earthquakes.
From our exploratory data analysis, we can take a number of insights, each that raise new questions that we could explore further by use of modelling, using additional data from other sources, or simply further EDA:
The raw number of earthquakes mostly occur in the East Asia and the Pacific, Latin America and Caribbean, Eastern-Asia, Japan and Southern Asia regions. This is along the Circum-Pacific seismic belt.
East Asia and Pacific has the most earthquakes but the Eastern-Asia, Japan region has the highest density of earthquakes.
Most earthquakes occur in the earths crust, less than \(70km\) deep. This is not discriminant of magnitude.
Jason Veljanoski (21980294)